Skip to content

Enhance parallel foreign scan support#1571

Closed
MisterRaindrop wants to merge 1 commit intoapache:mainfrom
MisterRaindrop:fdw_parallel_support
Closed

Enhance parallel foreign scan support#1571
MisterRaindrop wants to merge 1 commit intoapache:mainfrom
MisterRaindrop:fdw_parallel_support

Conversation

@MisterRaindrop
Copy link
Contributor

@MisterRaindrop MisterRaindrop commented Feb 10, 2026

support parallel foreign scan support and add a new mock fdw for testing parallel foreign scans.

Key changes include:

  • Implementation of parallel_foreign_scan_test_fdw to generate synthetic rows and support parallel scanning.
  • Modifications to the optimizer to generate gather paths for foreign tables with parallel capabilities.
  • Updates to execMain.c to enable parallel mode for gather nodes based on the execution context.
  • Addition of test cases to validate the functionality of the new FDW in both coordinator and all-segments modes.

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


add a new mock foreign data wrapper (FDW) for testing
parallel foreign scans.

Key changes include:
- Implementation of `parallel_foreign_scan_test_fdw`
to generate synthetic rows and support parallel scanning.
- Modifications to the optimizer to generate gather paths
for foreign tables with parallel capabilities.
- Updates to `execMain.c` to enable parallel mode for
gather nodes based on the execution context.
- Addition of test cases to validate the functionality
 of the new FDW in both coordinator and all-segments modes.
Copy link
Contributor

@avamingli avamingli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, FDW parallel scan is a direction worth exploring, but this approach is too rough. The core problems are:

  1. locus transition semantics for Gather in an MPP context haven't been thought through, and the changes are too broad.

  2. FDW is a black box from the database's perspective.
    For heap tables we have parallel scan (divide work by pages), for AO/AOCS we have parallel scan (divide work by files) — the work partitioning is well-defined.
    But for FDWs, the parallel behavior depends entirely on the FDW's own implementation. If an FDW (say file_fdw) sets parallel_safe = true following planner's parallel logic but doesn't actually implement the DSM parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, InitializeWorkerForeignScan), then multiple workers will each scan the full dataset, producing duplicate rows.

Comment on lines +3021 to +3030
/* Inherit locus from subpath — Gather collects within the same segment,
* data distribution across segments doesn't change. */
pathnode->path.locus = subpath->locus;
pathnode->path.locus.parallel_workers = 0; /* Gather output is single-stream */

pathnode->path.motionHazard = subpath->motionHazard;
pathnode->path.barrierHazard = subpath->barrierHazard;
pathnode->path.rescannable = false;
pathnode->path.sameslice_relids = subpath->sameslice_relids;

Copy link
Contributor

@avamingli avamingli Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_gather_path was guarded by Assert(false) for a reason — the locus semantics of Gather in an MPP context were never fully defined or implemented.

The comment is wrong, and the conclusion is wrong.
Gather does change the data distribution.
Simplest example: a Hash-distributed table's parallel partial scan has locus HashWorkers (data is hash-distributed across segments, and within each segment it's split across workers).
Gather collects all workers' data back to the leader process, so the locus should become Hash. You can't just copy the subpath locus and zero out parallel_workers — that's incorrect for HashWorkers , SegmentGeneral, and might be other locus types as well.
Also, this change is global — create_gather_path isn't FDW-specific. Once the Assert is gone, every code path that calls this function is affected. The hardest case is JOINs — mixing Gather with CBDB-style parallelism (Motion/slice) introduces a ton of problems. This PR doesn't seem to have considered any of that.

@MisterRaindrop
Copy link
Contributor Author

Overall, FDW parallel scan is a direction worth exploring, but this approach is too rough. The core problems are:

  1. locus transition semantics for Gather in an MPP context haven't been thought through, and the changes are too broad.
  2. FDW is a black box from the database's perspective.
    For heap tables we have parallel scan (divide work by pages), for AO/AOCS we have parallel scan (divide work by files) — the work partitioning is well-defined.
    But for FDWs, the parallel behavior depends entirely on the FDW's own implementation. If an FDW (say file_fdw) sets parallel_safe = true following planner's parallel logic but doesn't actually implement the DSM parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, InitializeWorkerForeignScan), then multiple workers will each scan the full dataset, producing duplicate rows.

I'm not very familiar with Cloudberry. Still learning.

FDW itself is a black box. Its specific implementation largely depends on how the user implements it. My understanding is that users need to take responsibility for their own implementations. Additionally, I should only enable gather for FDW. In other cases, it should remain false, this will parallel processing advantages of PostgreSQL?

Additionally, I've looked into other aspects of FDW parallelism. Currently, it seems there is no optimal solution.

So, should we aim to implement parallelism that is transparent to users? Or are there better approaches? Could you share some idea?

@avamingli
Copy link
Contributor

Overall, FDW parallel scan is a direction worth exploring, but this approach is too rough. The core problems are:

  1. locus transition semantics for Gather in an MPP context haven't been thought through, and the changes are too broad.
  2. FDW is a black box from the database's perspective.
    For heap tables we have parallel scan (divide work by pages), for AO/AOCS we have parallel scan (divide work by files) — the work partitioning is well-defined.
    But for FDWs, the parallel behavior depends entirely on the FDW's own implementation. If an FDW (say file_fdw) sets parallel_safe = true following planner's parallel logic but doesn't actually implement the DSM parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, InitializeWorkerForeignScan), then multiple workers will each scan the full dataset, producing duplicate rows.

I'm not very familiar with Cloudberry. Still learning.

FDW itself is a black box. Its specific implementation largely depends on how the user implements it. My understanding is that users need to take responsibility for their own implementations. Additionally, I should only enable gather for FDW. In other cases, it should remain false, this will parallel processing advantages of PostgreSQL?

Additionally, I've looked into other aspects of FDW parallelism. Currently, it seems there is no optimal solution.

So, should we aim to implement parallelism that is transparent to users? Or are there better approaches? Could you share some idea?

Neither PostgreSQL nor Cloudberry supports parallel FDW scans, that's a deliberate decision, not an oversight.

On the implementation side: having the kernel generate partial paths for FDW will cause FDWs that don't implement parallel scan callbacks to silently produce wrong results (e.g. duplicate rows). That's a kernel bug, not a user error — we can't shift that responsibility to FDW authors. And mixing Gather with CBDB-style parallelism remains fundamentally broken — the locus handling is wrong, and none of the issues I raised (joins, locus transitions, the overly broad execMain.c change) have been addressed.

More importantly, before discussing how, we need to answer why. What real-world problem does this solve in an MPP system where FDW is already used across segments? And given the risks I mentioned above — broken locus transitions, silent wrong results for existing FDWs, untested join/subquery interactions — even if it can be done, is it worth the complexity? If you want to push this forward, you need to make the case clearly: what's the motivation, and convince us that all the issues raised have sound solutions.

@MisterRaindrop
Copy link
Contributor Author

Overall, FDW parallel scan is a direction worth exploring, but this approach is too rough. The core problems are:

  1. locus transition semantics for Gather in an MPP context haven't been thought through, and the changes are too broad.
  2. FDW is a black box from the database's perspective.
    For heap tables we have parallel scan (divide work by pages), for AO/AOCS we have parallel scan (divide work by files) — the work partitioning is well-defined.
    But for FDWs, the parallel behavior depends entirely on the FDW's own implementation. If an FDW (say file_fdw) sets parallel_safe = true following planner's parallel logic but doesn't actually implement the DSM parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, InitializeWorkerForeignScan), then multiple workers will each scan the full dataset, producing duplicate rows.

I'm not very familiar with Cloudberry. Still learning.
FDW itself is a black box. Its specific implementation largely depends on how the user implements it. My understanding is that users need to take responsibility for their own implementations. Additionally, I should only enable gather for FDW. In other cases, it should remain false, this will parallel processing advantages of PostgreSQL?
Additionally, I've looked into other aspects of FDW parallelism. Currently, it seems there is no optimal solution.
So, should we aim to implement parallelism that is transparent to users? Or are there better approaches? Could you share some idea?

Neither PostgreSQL nor Cloudberry supports parallel FDW scans, that's a deliberate decision, not an oversight.

On the implementation side: having the kernel generate partial paths for FDW will cause FDWs that don't implement parallel scan callbacks to silently produce wrong results (e.g. duplicate rows). That's a kernel bug, not a user error — we can't shift that responsibility to FDW authors. And mixing Gather with CBDB-style parallelism remains fundamentally broken — the locus handling is wrong, and none of the issues I raised (joins, locus transitions, the overly broad execMain.c change) have been addressed.

More importantly, before discussing how, we need to answer why. What real-world problem does this solve in an MPP system where FDW is already used across segments? And given the risks I mentioned above — broken locus transitions, silent wrong results for existing FDWs, untested join/subquery interactions — even if it can be done, is it worth the complexity? If you want to push this forward, you need to make the case clearly: what's the motivation, and convince us that all the issues raised have sound solutions.

Parallel FDW primarily addresses the issue of slow data loading. This functionality was already implemented in earlier versions of PostgreSQL. Now, I am attempting to integrate this feature into MPP systems. In simple tests, parallelization has indeed delivered a performance improvement of one to two times. Such gains are essential for performance-sensitive business scenarios. Therefore, I am working to introduce this functionality. Alternatively, we could discuss the implementation plan in the issue tracker.

@MisterRaindrop
Copy link
Contributor Author

Thank you for the detailed review comments. Regarding the core issues raised (the security impact of partial paths on FDWs that do not support parallel callbacks, the mixing of Gather and CBDB gang models, locus conversion, and the scope of changes in execMain.c), I agree that these are all issues that need to be addressed seriously.

After reconsideration, I am inclined to withdraw the kernel-side modifications and adopt a pure FDW-layer solution instead. The core idea is:

  1. Do not modify the kernel's partial path generation, execMain.c, or locus logic—avoiding all the risks mentioned above.
  2. The FDW directly uses CBDB's existing parallel variables (ParallelWorkerNumberOfSlice / TotalParallelWorkerNumberOfSlice) to obtain the current worker number and total count.
  3. During execution, the FDW calculates the virtual segment ID based on these two values, modifies the HTTP header sent to PXF, and allows PXF's round-robin sharding mechanism to automatically distribute data evenly among all gang workers.

This solution requires no kernel modifications and will not affect other FDWs.

I would like to confirm: Is this direction reasonable? Are the variables ParallelWorkerNumberOfSlice and TotalParallelWorkerNumberOfSlice stable and reliable under the current CBDB parallel framework? Or do you have a more recommended way for the FDW to perceive gang parallel information?

@MisterRaindrop
Copy link
Contributor Author

Among the existing parallel frameworks in CBDB, after FDW registers a partial path via add_partial_path(), can the planner correctly trigger gang expansion and set ParallelWorkerNumberOfSlice? Or does it require additional kernel adaptation?

@avamingli
Copy link
Contributor

Parallel FDW primarily addresses the issue of slow data loading. This functionality was already implemented in earlier versions of PostgreSQL.

Where exactly? Are you referring to this commit?

@avamingli
Copy link
Contributor

This solution requires no kernel modifications and will not affect other FDWs.

I would like to confirm: Is this direction reasonable?

That sounds more reasonable.

@avamingli
Copy link
Contributor

Are the variables ParallelWorkerNumberOfSlice and TotalParallelWorkerNumberOfSlice stable and reliable under the current CBDB parallel framework?

Yes, they are stable and reliable under the current CBDB parallel framework. But I'm not sure how you plan to use them.

During execution, the FDW calculates the virtual segment ID based on these two values, modifies the HTTP header sent to PXF, and allows PXF's round-robin sharding mechanism to automatically distribute data evenly among all gang workers.

I'm not entirely sure I follow — isn't this essentially how MPP PXF works today? except the virtual segment ID based on these two values -- not sure, off the hand I think it's not enough, different Slice on same Segment could have same parallel workers.

@MisterRaindrop
Copy link
Contributor Author

Are the variables ParallelWorkerNumberOfSlice and TotalParallelWorkerNumberOfSlice stable and reliable under the current CBDB parallel framework?

Yes, they are stable and reliable under the current CBDB parallel framework. But I'm not sure how you plan to use them.

During execution, the FDW calculates the virtual segment ID based on these two values, modifies the HTTP header sent to PXF, and allows PXF's round-robin sharding mechanism to automatically distribute data evenly among all gang workers.

I'm not entirely sure I follow — isn't this essentially how MPP PXF works today? except the virtual segment ID based on these two values -- not sure, off the hand I think it's not enough, different Slice on same Segment could have same parallel workers.

Yes, essentially, it reuses the existing MPP round-robin sharding mechanism of PXF—by modifying the segment ID/count in the HTTP header, PXF can distribute data to N×W gang workers instead of N physical segments. No changes are required on the PXF server side.

Regarding ParallelWorkerNumberOfSlice: From the assignment logic in parallel.c, workers on the same segment are assigned incrementally via DSM entry (0, 1, 2, ...), which should be unique. However, I want to confirm: In the CBDB parallel framework, is this value guaranteed to be unique within the same slice on the same segment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants